Machine learning estimation of crude oil viscosity as function of API, temperature, and oil composition: Model optimization and design space

Measurement of viscosity of crude oil is critical for reservoir simulators. Computational modeling is a useful tool for correlation of crude oil viscosity to reservoir conditions such as pressure, temperature, and fluid compositions. In this work, multiple distinct models are applied to the available dataset to predict heavy-oil viscosity as function of a variety of process parameters and oil properties. The computational techniques utilized in this work are Decision Tree (DT), MLP, and GRNN which were utilized in estimation of heavy crude oil samples collected from middle eastern oil fields. For the estimation of viscosity, the firefly algorithm (FA) was employed to optimize the hyper-parameters of the machine learning models. The RMSE error rates for the final models of DT, MLP, and GRNN are 40.52, 25.08, and 30.83, respectively. Also, the R2-scores are 0.921, 0. 978, and 0.933, respectively. Based on this and other criteria, MLP is chosen as the best model for this study in estimating the values of crude oil viscosity.


Introduction
For simulation of oil flow in different media such as in the well, pipeline, and processing, the viscosity of oil plays crucial role, and the accuracy of simulations depends on the accuracy of viscosity determination. Indeed, robust, and reliable models are required to estimate the viscosity of oils (e.g., crude oils) for a wide range of oil sources. Sometimes, the gas might be dissolved in the oil, and its viscosity estimation would be challenging. Khemka et al. [1] proposed a method for viscosity modeling of light crude oils containing dissolved gas. They used oneparameter friction theory for estimation of viscosity of oil under gas injection. The accuracy of the viscosity prediction is of great importance, and unrealistic models would make pitfalls in simulation of oil behavior such as those in the reservoir.
Considering the composition of the crude oils which contain a range of hydrocarbons, the realistic models should be able to take into account the composition of the crude oil when estimating the viscosity values [2]. Kamel  outperformed other oil estimators such as corresponding-state, EoS, etc. The statistical analysis of their model revealed that the average relative error of the fitting was 3.8%. Development of holistic models for estimation of viscosity of oils from different sources is demanding, and advanced methods such as machine learning (ML) could be employed for this application. There are limited developed ML techniques for prediction of crude oils properties based on compositional data, and there is a research gap in this area to be addressed. Analytics are losing ground to machine learning (ML) techniques in the scientific community owing to the power of ML methods in data analytics applications [3]. Different versions of artificial neural networks (NN), models based on decision trees, and other linear and non-linear models are all examples of these methods in action. Now, machine learning models may look into any issue with a set of inputs and a set of desired outcomes [4]. Using a wide variety of methods, these models determine if there is a connection between variables [5][6][7]. In this research, three methods are implemented: Multiple Layers Perceptrons (MLP), Decision Tree (DT), and GRNN for estimation of crude oil viscosity based on compositional data [8].
The term "multilayer perceptron" (MLP) describes a specific type of neural network that consists of several layers of perceptrons. Multi-layer Perceptrons (MLPs) are an artificial neural network type that feeds new information into the network as it is MLP consists of at least three layers of inputs, outputs, and hidden layers. Nodes in the input layer are not switched on; rather, they stand in for the actual data point. A d-dimensional vector reflecting the data point would result in a d-dimensional number of nodes in the input layer [9,10].
Radial basis function neural networks are utilized in the GRNN, which is yet another model that is based on neural networks (RBF). RBF uses a probabilistic framework to simulate the dependent variables of a regression function. It is impossible for other neural networks to reach a local optimum due to its probabilistic construction [11].
A decision tree (DT) is a ML technique used to solve classification and regression roots. The decision tree has advantages over other classification systems since it uses a hierarchical decision-making framework rather than merely grouping features (or bands) together. To solve many kinds of machine learning issues, the DT provides a hierarchical and understandable paradigm. We start at the root of the DT and move our way down the tree based on the value of each characteristic in each node's subtree. This procedure is repeated until no more leaves or nodes remain [8,[12][13][14].
The main objective of the current study is to develop a machine learning strategy for estimation of crude oil viscosity based on compositional data. For the first time, the machine learning methods of Multiple Layers Perceptrons (MLP), Decision Tree (DT), and GRNN are used and tuned to estimate the viscosity as the target parameter, while the input parameters are the oil compositions and its physical properties. Statistical analysis is then performed to evaluate the performance of the tuned machine learning models, and select the best one.

Data set of crude oil
We used a dataset derived from research described in [2] describing the measurements and correlations of heavy crude oils viscosity versus a number of input parameters. A subset of training data is divided into a subset of testing data. An analysis of 28 samples of heavy crude oil supplied from the Middle East is included in the collection. As part of the construction of the model, 196 separate measurements of viscosity were collected at temperatures ranging from 20 to 80 degrees Celsius. A total of 47 additional viscosity measurements were performed to validate the model. Additionally, previously reported methods for estimating the viscosity of heavy oils were validated based on the composition and viscosity results. Heavy-oil viscosities were forecast using these methods. The dataset used in this work are listed in Tables 1 and 2.

Methods of computations
In this research, the data are first pre-processed with the help of Cook's distance and Min-Max normalization methods, and the prepared data sets are tested with the models in order to obtain optimal configurations. For this, we use the metaheuristic FA (firefly algorithm)

PLOS ONE
Simulation of oil properties

Decision tree
DT is a widely accepted learning technique that can address a variety of problems. A DT is made up of three parts: a root (start) node, several internal (decision) nodes, and several leaf (terminal) nodes. The model's output is represented by the leaf (terminal) nodes, while new information is introduced into the network at its root node [15]. There are some "decision nodes" in between the "root" and "leaf" nodes. In a typical network, information starts at the root node and travels outwards through the intermediate nodes until reaching the final node. The algorithm receives data as input and proceeds to construct a tree by a process of splitting, pruning, and terminating branches [12-14, 16, 17]. These actions start at the root node and progress till a certain condition has been achieved in the process. Fig 2 depicts a simplified decision tree conceptually.

Multilayer perceptron and GRNN
The concept of artificial neural networks was conceived in 1943 [18]. The perceptron, the first functional artificial neural network, was unveiled in 1958 [19]. The use of neural networks has increased in prominence since 1986 [20]. Neural networks use neurons as their fundamental building element since they are modelled after the nervous system. A variety of neural networks can be formed based on the connections, neuron model, and weight modification methods [10,21]. Methods such as the Multilayer Perceptron (MLP) of artificial neural networks (ANN) could be employed to mimic the possible hidden correlations between the in and out data of processes [8,10,22]. Updates and optimizations based on work complexity enable a variable approach to hidden layer size. The MLP system's artificial neurons are structured in a three-layered network [8,10,23].
The following equation is used to determine neuron input weights [22]: The activation function, f(z), can be calculated using a number of continuously differentiable functions, including the more modern ReLU, which is widely employed in the method of deep learning [8,24,25].
The GRNN model is a type of NN based on the radial basis function (RBF). RBF models the dependent variables in a regression problem using a probabilistic framework. Because of their probabilistic design, other neural networks are vulnerable to local optimum [26].

Firefly algorithm (FA) optimization approach
The firefly optimization algorithm (FA) is an innovative meta-heuristic algorithm that takes its name and inspiration from the flashing light of a firefly. The algorithm has many similarities to other swarm intelligence approaches like PSO, BFO, and others, but is easier to understand and implement. Accurately, FA simultaneously discovers both global and local optimums. Yang et al. developed and published this algorithm in [27][28][29]. Its primary advantage is that it is based on global communication among swarming particles (i.e., fireflies), making it appear more successful in multi-objective optimization. Yang et al. [29] go over the theoretical and technical aspects of the proposed method in greater detail [30].

Results and discussion
As mentioned in the explanation of the proposed method, before normalization, the existing data are evaluated with the help of Cook's distance in the field of outliers, the result of which is shown in Fig 3. This figure shows that only 5% of the data as outliers should be removed from the data set in order to get better results of the modeling. After the pre-processing of the dataset, with the help FA algorithm that was explained earlier, the models are optimized and tuned with their hyper-parameters to obtain the final models, the results of these models in terms of R-squared emissivity and error rates in the Tables 3  and 4 are displayed. It is seen that the MLP model has better accuracy in estimation of oil viscosity compared to DT and GRNN models. The statistical parameters including R 2 , MAE, RMSE, and MAPE confirm the accuracy of the tuned MLP model for this particular application in petroleum engineering.
In addition, the comparisons of the expected values and the predicted values are shown in Figs 4-6, where the blue points are the training data, and the red points are the test data. The comparison of these methods shows the fact that the models are very close to each other in terms of training data, but with accuracy in the test data, the MLP model can be considered the best model, therefore, the rest of the analyses are done with this model. Among other things, the residuals of this final model are shown in Fig 7. Using the MLP model, which is tuned using FA algorithm, the viscosity analysis was performed the results of which are illustrated in Figs 8-13 in form of 3D and 2D plots. As seen, the temperature has the most significant effect on the variations of crude oil viscosity and the value of viscosity is highly dependent on the temperature. Moreover, density and API of oils have significant effects on the viscosity after the temperature factor. It was also revealed that the variations of viscosity with the molecular weight and the oil compositions are not substantial, compared to other parameters. The results are in agreement with the previously reported correlations for the viscosity estimation using compositional data [2].

Conclusion
In the field of petroleum science, viscosity measurement of heavy crude oil is crucial, and reservoir simulators are commonly used for this purpose. In this study, multiple distinct models are used to predict the viscosity of heavy oil using the available data. The Decision Tree (DT), MLP, and GRNN models are used in this study, and the firefly algorithm (FA) is used to optimize the hyper-parameters of these models. For the final models of DT, MLP, and GRNN, the RMSE error rates are 40.52, 25.08, and 30.83, respectively. In addition, the respective R 2 -scores are 0.921, 0.978, and 0.933. MLP was selected as the best model for this study in estimating the oil viscosity via compositional data. Compared with research reported in [2], the result obtained from MLP is almost equal to the R 2 criterion in the test, but it shows a better result in terms of other values in the test phase. This fact shows the effect of optimizing hyper-parameters and removing outliers on obtaining a better and more general model.